Generate gene reference files #47

j23414 · 2024-05-08T00:41:48Z

Description of proposed changes

In order to support gene phylogenetic trees (e.g. E gene trees), add rules to automatically generate gene reference GenBank and FASTA files (e.g. reference_denv4_E.gb and reference_denv4_E.fasta) by following the rules used in RSV.

This is part of a larger and older issue of creating E gene builds and is being split out into smaller PRs to maintain QC and scope of review. This will not generate an E gene phylogenetic tree, subsequent PRs will modify this to generate the trees.

Visual summary (view whole pipeline plan so far)

Related issue(s)

Issue: Add E gene builds #17
Earlier PR: Add E gene trees #18
Related PR in RSV: Allows for CDS (as well as gene) features to generate a new gene reference rsv#55

Checklist

Checks pass
The phylogenetic workflow for whole genome still works (github action "phylogenetic")
- all | denv1 | denv2 | denv3 | denv4
Valid E gene reference files are produced with a manual call to:

nextstrain build phylogenetic results/config/reference_all_E.gb results/config/reference_all_E.fasta
nextstrain build phylogenetic results/config/reference_denv1_E.gb results/config/reference_denv1_E.fasta
nextstrain build phylogenetic results/config/reference_denv2_E.gb results/config/reference_denv2_E.fasta
nextstrain build phylogenetic results/config/reference_denv3_E.gb results/config/reference_denv3_E.fasta
nextstrain build phylogenetic results/config/reference_denv4_E.gb results/config/reference_denv4_E.fasta

Example shortened reference_denv2_E.gb

LOCUS       DENV2/THAILAND/REFERENCE/1964 1485 bp    DNA              UNK 01-JAN-1980
DEFINITION  Dengue virus 2, complete genome.
ACCESSION   NC_001474
VERSION     NC_001474.2
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             1..1485
                     /gene="E"
                     /db_xref="VBRC:35921"
                     /product="envelope protein E"
                     /protein_id="NP_739583.2"
     source          1..1485
                     /collection_date="1964"
                     /country="Thailand"
                     /db_xref="taxon:11060"
                     /mol_type="genomic RNA"
                     /organism="Dengue virus 2"
                     /strain="16681"
ORIGIN
        1 atgcgttgca taggaatgtc aaatagagac tttgtggaag gggtttcagg aggaagctgg
       61 gttgacatag tcttagaaca tggaagctgt gtgacgacga tggcaaaaaa caaaccaaca
      121 ttggattttg aactgataaa aacagaagcc aaacagcctg ccaccctaag gaagtactgt
      ...
     1381 gtcattatca catggatagg aatgaattca cgcagcacct cactgtctgt gacactagta
     1441 ttggtgggaa ttgtgacact gtatttggga gtcatggtgc aggcc
//

phylogenetic/bin/newreference.py

phylogenetic/config/reference_denv2_genome.gb

phylogenetic/rules/prepare_sequences.smk

This is a fixup to an earlier commit: 8cd6a13 This updates the docs to reflect that the script will NOT just throw a warning, but actually error out if the gene is not found in the GenBank file. This was flagged by comment: nextstrain/dengue#47 (comment)

j23414 · 2024-05-08T23:02:23Z

I was wondering why the CI was taking so long, then remembered that example files gets connected to "phylogenetic/data"

https://github.com/nextstrain/.github/blob/4f41fa6db826dff3f1eb09f8d2e0a1512c9e358d/.github/workflows/pathogen-repo-ci.yaml#L236-L237

Fixed with: 30b1d5a
CI seems much faster

https://github.com/nextstrain/rsv/blob/a1788ce2c9c4375fb5a06d1426c64c45cf90225f/scripts/newreference.py fixup: fix comments to match behavior Co-authored-by: John SJ Anderson <[email protected]>

Adds some wildcard constraints on serotype-gene combinations to avoid unchecked wildcard matching, such as having {serotype}.fasta match both "denv1_E.fasta" and "denv1.fasta".

This is in preperation of having separate genome and gene (e.g. E, NS1) reference files.

This is in preperation of nesting each gene's specific files in subdirectories (e.g. `results/E/tree.nwk`) as suggested in comment: * nextstrain/private#102 (comment)

In prep of building "genome" and "E" intermediate and final files for the phylogenetic pipeline.

Move gene annotation to top of CDS to match other genbank files (denv1,3,4)

This generates the reference_serotype_gene.gb and reference_serotype_gene.fasta files for each serotype. These files can then be subsequently used in augur align, augur translate, and optionally for nextclade align during the gene trees.

This is a fixup to an earlier commit: 8cd6a13 This updates the docs to reflect that the script will NOT just throw a warning, but actually error out if the gene is not found in the GenBank file. This was flagged by comment: nextstrain/dengue#47 (comment)

j23414 force-pushed the generate-gene-reference-files branch from 58099ae to 4102012 Compare May 8, 2024 16:30

j23414 requested a review from a team May 8, 2024 18:01

genehack reviewed May 8, 2024

View reviewed changes

phylogenetic/bin/newreference.py Outdated Show resolved Hide resolved

phylogenetic/config/reference_denv2_genome.gb Outdated Show resolved Hide resolved

phylogenetic/rules/prepare_sequences.smk Outdated Show resolved Hide resolved

j23414 mentioned this pull request May 8, 2024

docs: fix documentation for newreference.py to reflect --gene behavior nextstrain/rsv#60

Merged

1 task

genehack approved these changes May 8, 2024

View reviewed changes

j23414 and others added 9 commits May 8, 2024 16:19

Copy newreference script from RSV

8bab02b

https://github.com/nextstrain/rsv/blob/a1788ce2c9c4375fb5a06d1426c64c45cf90225f/scripts/newreference.py fixup: fix comments to match behavior Co-authored-by: John SJ Anderson <[email protected]>

fixup: file permissions

c725b31

Use wildcard constraints for serotype-gene combinations

3103591

Adds some wildcard constraints on serotype-gene combinations to avoid unchecked wildcard matching, such as having {serotype}.fasta match both "denv1_E.fasta" and "denv1.fasta".

Add genome postfix to reference files

9120d08

This is in preperation of having separate genome and gene (e.g. E, NS1) reference files.

Add results/genome subdirectory

c05d129

This is in preperation of nesting each gene's specific files in subdirectories (e.g. `results/E/tree.nwk`) as suggested in comment: * nextstrain/private#102 (comment)

Parameterize "genome" in the phylogenetic pipeline

08855dc

In prep of building "genome" and "E" intermediate and final files for the phylogenetic pipeline.

drop dengue in reference filenames

1f87711

fixup: dengue2 reference genbank

08c7066

Move gene annotation to top of CDS to match other genbank files (denv1,3,4)

Add rule for generatating gene reference files

f5b7bf6

This generates the reference_serotype_gene.gb and reference_serotype_gene.fasta files for each serotype. These files can then be subsequently used in augur align, augur translate, and optionally for nextclade align during the gene trees.

j23414 force-pushed the generate-gene-reference-files branch from 30b1d5a to f5b7bf6 Compare May 8, 2024 23:33

j23414 merged commit e720a96 into main May 9, 2024
41 checks passed

j23414 deleted the generate-gene-reference-files branch May 9, 2024 22:56

j23414 mentioned this pull request May 13, 2024

Use gene reference files to generate E gene trees #48

Merged

2 tasks

This was referenced May 23, 2024

Add E gene trees #18

Closed

Add E gene builds #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate gene reference files #47

Generate gene reference files #47

j23414 commented May 8, 2024 •

edited

Loading

j23414 commented May 8, 2024

Generate gene reference files #47

Generate gene reference files #47

Conversation

j23414 commented May 8, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

j23414 commented May 8, 2024

j23414 commented May 8, 2024 •

edited

Loading